Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.m

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: No sampling, use all the data.
Tools: Use python module Pycaret for classification.
Question: How many frauds are correctly classified?

Introduction to Boosting

The term Boosting refers to a family of algorithms which converts weak learner to strong learners.

There are many boosting algorithms:

sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor # fast and best
lightgbm.LGBMRegressor # extreme fast, little acc than xgb
catboost.CatBoostRegressor # good for categorical feats

Imports

In [1]:
import time

notebook_start_time = time.time()
In [2]:
import sys
ENV_BHISHAN = None

try:
    import bhishan
    print('Environment: Personal environment')
    ENV_BHISHAN = True
    %load_ext autoreload
    %autoreload 2
except:
    print('Module "bhishan" not found.')
Environment: Personal environment
In [3]:
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    #!pip install hpsklearn
    !pip install shap eli5
    !pip install catboost
    !pip install ipywidgets
    !jupyter nbextension enable --py widgetsnbextension

    # set OMP_NUM_THREADS=1 for hpsklearn package
    #!export OMP_NUM_THREADS=1
    print('Environment: Google Colab')
In [4]:
import numpy as np
import pandas as pd

SEED = 100

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8,8
plt.rcParams.update({'font.size': 16})

plt.style.use('ggplot')
%matplotlib inline

import seaborn as sns
sns.set(color_codes=True)
In [7]:
# boosting
import xgboost as xgb
import lightgbm as lgb
import catboost

from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBClassifier, DMatrix
from lightgbm import LGBMClassifier, Dataset
from catboost import CatBoostClassifier, Pool, CatBoost

print([(x.__name__,x.__version__) for x in [xgb, lgb,catboost]])
[('xgboost', '0.80'), ('lightgbm', '2.3.1'), ('catboost', '0.20.2')]
In [8]:
# six and pickle
import six
import pickle
import joblib
In [9]:
# scale and split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
In [10]:
# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
In [11]:
# sklearn scalar metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
In [12]:
# roc auc and curves
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
In [13]:
# confusion matrix and classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
In [14]:
import time
In [15]:
from hyperopt import hp, tpe, fmin, Trials, STATUS_OK, STATUS_FAIL
from hyperopt.pyll import scope
from hyperopt.pyll.stochastic import sample
import copy
import pprint
pp = pprint.PrettyPrinter(indent=4)
In [16]:
# model intepretation modules
import eli5
import shap
# shap_values = shap.TreeExplainer(model_xgb).shap_values(Xtest)
# shap.summary_plot(shap_values, Xtest)
# shap.dependence_plot("column_name", shap_values, Xtest)

Useful Scripts

In [17]:
def show_method_attributes(obj, ncols=7,start=None, inside=None):
    """ Show all the attributes of a given method.
    Example:
    ========
    show_method_attributes(list)
     """
    lst = [elem for elem in dir(obj) if elem[0]!='_' ]
    lst = [elem for elem in lst 
           if elem not in 'os np pd sys time psycopg2'.split() ]

    if isinstance(start,str):
        lst = [elem for elem in lst if elem.startswith(start)]
        
    if isinstance(start,tuple) or isinstance(start,list):
        lst = [elem for elem in lst for start_elem in start
               if elem.startswith(start_elem)]
        
    if isinstance(inside,str):
        lst = [elem for elem in lst if inside in elem]
        
    if isinstance(inside,tuple) or isinstance(inside,list):
        lst = [elem for elem in lst for inside_elem in inside
               if inside_elem in elem]

    return pd.DataFrame(np.array_split(lst,ncols)).T.fillna('')
In [18]:
df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                    })

Load the data

In [19]:
ifile = 'https://github.com/bhishanpdl/Project_Fraud_Detection/blob/master/data/raw/creditcard.csv.zip?raw=true'
df = pd.read_csv(ifile,compression='zip')
print(df.shape)
df.head()
(284807, 31)
Out[19]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

In [20]:
target = 'Class'
features = df.columns.drop(target)
df[target].value_counts(normalize=True)*100
Out[20]:
0    99.827251
1     0.172749
Name: Class, dtype: float64

Train test split with stratify

In [21]:
from sklearn.model_selection import train_test_split

df_Xtrain_orig, df_Xtest, ser_ytrain_orig, ser_ytest = train_test_split(
    df.drop(target,axis=1), 
    df[target],
    test_size=0.2, 
    random_state=SEED, 
    stratify=df[target])

ytrain_orig = ser_ytrain_orig.to_numpy().ravel()
ytest = ser_ytest.to_numpy().ravel()

print(df_Xtrain_orig.shape)
df_Xtrain_orig.head()
(227845, 30)
Out[21]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
211885 138616.0 -1.137612 2.345154 -1.767247 0.833982 0.973168 -0.073571 0.802433 0.733137 -1.154087 ... 0.062820 0.114953 0.430613 -0.240819 0.124011 0.187187 -0.402251 0.196277 0.190732 39.46
12542 21953.0 -1.028649 1.141569 2.492561 -0.242233 0.452842 -0.384273 1.256026 -0.816401 1.964560 ... 0.350032 -0.380356 -0.037432 -0.503934 0.407129 0.604252 0.233015 -0.433132 -0.491892 7.19
270932 164333.0 -1.121864 -0.195099 1.282634 -3.172847 -0.761969 -0.287013 -0.586367 0.496182 -2.352349 ... -0.113632 -0.328953 -0.856937 -0.056198 0.401905 0.406813 -0.440140 0.152356 0.030128 40.00
30330 35874.0 1.094238 -0.760568 -0.392822 -0.611720 -0.722850 -0.851978 -0.185505 -0.095131 -1.122304 ... 0.354148 -0.227392 -1.254285 0.022116 -0.141531 0.114515 -0.652427 -0.037897 0.051254 165.85
272477 165107.0 2.278095 -1.298924 -1.884035 -1.530435 -0.649500 -0.996024 -0.466776 -0.438025 -1.612665 ... -0.341708 0.123892 0.815909 -0.072537 0.784217 0.403428 0.193747 -0.043185 -0.058719 60.00

5 rows × 30 columns

Train Validation with stratify

In [22]:
df_Xtrain, df_Xvalid, ser_ytrain, ser_yvalid = train_test_split(
    df_Xtrain_orig, 
    ser_ytrain_orig,
    test_size=0.2, 
    random_state=SEED, 
    stratify=ser_ytrain_orig)


ytrain = ser_ytrain.to_numpy().ravel()
yvalid = ser_yvalid.to_numpy().ravel()

print(df_Xtrain.shape)
(182276, 30)
In [23]:
# random undersampling
n = df[target].value_counts().values[-1]
df_under = (df.groupby(target)
                .apply(lambda x: x.sample(n,random_state=SEED))
                .reset_index(drop=True))

df_Xtrain_orig_under, df_Xtest_under, ser_ytrain_orig_under, ser_ytest_under = train_test_split(
    df_under.drop(target,axis=1),
    df_under[target],
    test_size=0.2, 
    random_state=SEED, 
    stratify=df_under[target])

df_Xtrain_under, df_Xvalid_under, ser_ytrain_under, ser_yvalid_under = train_test_split(
    df_Xtrain_orig_under,
    ser_ytrain_orig_under,
    test_size=0.2, 
    random_state=SEED, 
    stratify=ser_ytrain_orig_under)

ser_ytrain.value_counts(), ser_ytest.value_counts(), ser_yvalid.value_counts()
Out[23]:
(0    181961
 1       315
 Name: Class, dtype: int64, 0    56864
 1       98
 Name: Class, dtype: int64, 0    45490
 1       79
 Name: Class, dtype: int64)

Modelling catboost

https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

class CatBoostRegressor(iterations=None,learning_rate=None,depth=None,
l2_leaf_reg=None,model_size_reg=None,rsm=None,loss_function='RMSE',
border_count=None,feature_border_type=None
per_float_feature_quantization=None,input_borders=None,
output_borders=None,fold_permutation_block=None,od_pval=None,
od_wait=None,od_type=None,nan_mode=None,counter_calc_method=None,
leaf_estimation_iterations=None,leaf_estimation_method=None,
thread_count=None,random_seed=None,use_best_model=None,
best_model_min_trees=None,verbose=None,silent=None,logging_level=None,
metric_period=None,ctr_leaf_count_limit=None,store_all_simple_ctr=None,
max_ctr_complexity=None,
has_time=None,allow_const_label=None,one_hot_max_size=None,
random_strength=None,name=None,ignored_features=None,
train_dir=None,custom_metric=None,eval_metric=None,
bagging_temperature=None,save_snapshot=None,
snapshot_file=None,snapshot_interval=None,
fold_len_multiplier=None,used_ram_limit=None,gpu_ram_part=None,
pinned_memory_size=None,allow_writing_files=None,
final_ctr_computation_mode=None,approx_on_full_history=None,
boosting_type=None,simple_ctr=None,combinations_ctr=None,
per_feature_ctr=None,ctr_target_border_count=None,task_type=None,
device_config=None,devices=None,bootstrap_type=None,subsample=None,
sampling_unit=None,dev_score_calc_obj_block_size=None,
max_depth=None,n_estimators=None,num_boost_round=None,
num_trees=None,colsample_bylevel=None,random_state=None,
reg_lambda=None,objective=None,eta=None,max_bin=None,
gpu_cat_features_storage=None,data_partition=None,
metadata=None,early_stopping_rounds=None,cat_features=None,
grow_policy=None,min_data_in_leaf=None,min_child_samples=None,
max_leaves=None,num_leaves=None,score_function=None,
leaf_estimation_backtracking=None,ctr_history_unit=None,
monotone_constraints=None)
In [24]:
import catboost
show_method_attributes(catboost)
Out[24]:
0 1 2 3 4 5 6
0 CatBoost CatBoostRegressor EFstrType MetricVisualizer core sum_models version
1 CatBoostClassifier CatboostError FeaturesData Pool cv train widget
2 CatBoostError
In [25]:
from catboost import CatBoostClassifier, Pool

show_method_attributes(CatBoostClassifier)
Out[25]:
0 1 2 3 4 5 6
0 best_iteration_ drop_unused_features get_best_score get_object_importance is_fitted predict_proba set_leaf_values
1 best_score_ eval_metrics get_borders get_param iterate_leaf_indexes random_seed_ set_params
2 calc_feature_statistics evals_result_ get_cat_feature_indices get_params learning_rate_ randomized_search shrink
3 calc_leaf_indexes feature_importances_ get_evals_result get_test_eval load_model save_borders staged_predict
4 classes_ feature_names_ get_feature_importance get_test_evals plot_predictions save_model staged_predict_log_proba
5 compare fit get_leaf_values get_text_feature_indices plot_tree score staged_predict_proba
6 copy get_all_params get_leaf_weights get_tree_leaf_counts predict set_feature_names tree_count_
7 create_metric_calcer get_best_iteration get_metadata grid_search predict_log_proba
In [26]:
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score,  precision_score, recall_score,f1_score
from sklearn.metrics import confusion_matrix

# time
time_start = time.time()

# current parameters
desc = 'default,random_state=100, numpy'

Xtr = df_Xtrain.to_numpy()
ytr = ser_ytrain.to_numpy().ravel()
Xtx = df_Xtest.to_numpy()
ytx = ser_ytest.to_numpy().ravel()


# fit the model
model_cat = CatBoostClassifier(verbose=100,random_state=SEED)

model_cat.fit(Xtr, ytr)

# fitted model
model = model_cat


# save the model
# joblib.dump(model_cat, 'model_cat.pkl')
# model_cat = joblib.load('model_cat.pkl')


# predictions
skf = StratifiedKFold(n_splits=2,shuffle=True,random_state=SEED)
ypreds_cv = cross_val_predict(model_cat, Xtx, ytx, cv=skf)
ypreds = ypreds_cv



# model evaluation
row_eval = ['catboost','default, seed=100', 
            accuracy_score(ytx, ypreds),
            precision_score(ytx, ypreds, average='micro'),
            recall_score(ytx, ypreds, average='micro'),
            f1_score(ytx, ypreds, average='micro'),
            roc_auc_score(ytx, ypreds),
       ]

df_eval.loc[len(df_eval)] = row_eval
df_eval = df_eval.drop_duplicates()
time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))
display(df_eval)
Learning rate set to 0.073099
0:	learn: 0.4520944	total: 117ms	remaining: 1m 57s
100:	learn: 0.0016060	total: 5.89s	remaining: 52.4s
200:	learn: 0.0011228	total: 10.8s	remaining: 42.9s
300:	learn: 0.0008612	total: 15.7s	remaining: 36.4s
400:	learn: 0.0006628	total: 20.5s	remaining: 30.6s
500:	learn: 0.0005038	total: 25.4s	remaining: 25.3s
600:	learn: 0.0003722	total: 30.6s	remaining: 20.3s
700:	learn: 0.0002731	total: 35.4s	remaining: 15.1s
800:	learn: 0.0001932	total: 40.3s	remaining: 10s
900:	learn: 0.0001492	total: 45.5s	remaining: 5s
999:	learn: 0.0001250	total: 50.3s	remaining: 0us
Learning rate set to 0.043228
0:	learn: 0.5620845	total: 36.9ms	remaining: 36.9s
100:	learn: 0.0016339	total: 1.71s	remaining: 15.2s
200:	learn: 0.0008008	total: 3.43s	remaining: 13.6s
300:	learn: 0.0004417	total: 5.1s	remaining: 11.8s
400:	learn: 0.0002461	total: 6.88s	remaining: 10.3s
500:	learn: 0.0001709	total: 8.67s	remaining: 8.64s
600:	learn: 0.0001256	total: 9.74s	remaining: 6.46s
700:	learn: 0.0000987	total: 10.8s	remaining: 4.58s
800:	learn: 0.0000809	total: 11.8s	remaining: 2.92s
900:	learn: 0.0000679	total: 12.8s	remaining: 1.4s
999:	learn: 0.0000582	total: 13.9s	remaining: 0us
Learning rate set to 0.043228
0:	learn: 0.5632700	total: 14.2ms	remaining: 14.2s
100:	learn: 0.0020027	total: 1.08s	remaining: 9.6s
200:	learn: 0.0008517	total: 2.19s	remaining: 8.71s
300:	learn: 0.0004443	total: 3.24s	remaining: 7.52s
400:	learn: 0.0002635	total: 4.31s	remaining: 6.44s
500:	learn: 0.0001938	total: 5.33s	remaining: 5.31s
600:	learn: 0.0001567	total: 6.36s	remaining: 4.22s
700:	learn: 0.0001340	total: 7.37s	remaining: 3.14s
800:	learn: 0.0001134	total: 8.39s	remaining: 2.08s
900:	learn: 0.0000962	total: 9.45s	remaining: 1.04s
999:	learn: 0.0000822	total: 10.5s	remaining: 0us
Time taken: 1 min 16 secs
Model Description Accuracy Precision Recall F1 AUC
0 catboost default, seed=100 0.999403 0.999403 0.999403 0.999403 0.85709
In [27]:
# calculate the FPR and TPR for all thresholds of the classification

from sklearn import metrics

yprobs = model_cat.predict_proba(df_Xtest)
ypreds = yprobs[:,1]

fpr, tpr, threshold = metrics.roc_curve(ytest, ypreds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'red', label = 'ROC AUC score = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'b--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
In [28]:
import eli5

# eli5.explain_weights_catboost(model_cat) # same thing
eli5.show_weights(model_cat)
Out[28]:
Weight Feature
0.0770 4
0.0733 14
0.0709 1
0.0606 29
0.0537 8
0.0493 0
0.0479 9
0.0453 12
0.0443 26
0.0396 24
0.0307 19
0.0294 15
0.0290 10
0.0276 25
0.0275 16
0.0267 6
0.0254 7
0.0249 2
0.0239 13
0.0223 11
… 10 more …

Catboost with validation set

In [29]:
df_Xtrain.head(2)
Out[29]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
35574 38177.0 1.430419 -0.718078 0.364706 -0.744257 -0.556090 0.698948 -0.949852 0.131008 -0.314353 ... 0.158424 0.042013 0.429576 -0.301931 -0.933773 0.840490 -0.027776 0.044688 -0.007522 0.2
46862 42959.0 -2.425523 -1.790293 2.522139 0.581141 0.918453 0.594426 0.224541 0.373885 -0.168411 ... 0.984535 0.538438 0.877560 0.590595 -0.293545 0.524022 -0.328189 -0.205285 -0.109163 300.0

2 rows × 30 columns

In [30]:
# time
time_start = time.time()

# current parameters
Xtr = df_Xtrain
ytr = ser_ytrain.to_numpy().ravel()
Xtx = df_Xtest
ytx = ser_ytest.to_numpy().ravel()
Xvd = df_Xvalid
yvd = ser_yvalid.to_numpy().ravel()


# fit the model
model = CatBoostClassifier(random_state=0,verbose=100)
model.fit(Xtr, ytr,
          eval_set=(Xvd, yvd))


# ypreds
skf=StratifiedKFold(n_splits=5,shuffle=True,random_state=SEED)
ypreds = cross_val_predict(model, Xtx, ytx, cv=skf)

# r-squared values
r = roc_auc_score(ytx, ypreds)

# time
time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))

print('ROC AUC Score ', r)
Learning rate set to 0.134781
0:	learn: 0.3027203	test: 0.3029591	best: 0.3029591 (0)	total: 43ms	remaining: 43s
100:	learn: 0.0011549	test: 0.0026406	best: 0.0026244 (85)	total: 3.3s	remaining: 29.3s
200:	learn: 0.0007232	test: 0.0026344	best: 0.0026101 (153)	total: 6.44s	remaining: 25.6s
300:	learn: 0.0004305	test: 0.0027123	best: 0.0026101 (153)	total: 9.61s	remaining: 22.3s
400:	learn: 0.0002653	test: 0.0027696	best: 0.0026101 (153)	total: 12.8s	remaining: 19.1s
500:	learn: 0.0001623	test: 0.0028454	best: 0.0026101 (153)	total: 16s	remaining: 15.9s
600:	learn: 0.0001167	test: 0.0028703	best: 0.0026101 (153)	total: 19.1s	remaining: 12.7s
700:	learn: 0.0000867	test: 0.0029258	best: 0.0026101 (153)	total: 22.3s	remaining: 9.51s
800:	learn: 0.0000702	test: 0.0029784	best: 0.0026101 (153)	total: 25.4s	remaining: 6.3s
900:	learn: 0.0000601	test: 0.0030353	best: 0.0026101 (153)	total: 28.4s	remaining: 3.12s
999:	learn: 0.0000533	test: 0.0030564	best: 0.0026101 (153)	total: 31.4s	remaining: 0us

bestTest = 0.002610121031
bestIteration = 153

Shrink model to first 154 iterations.
Learning rate set to 0.049377
0:	learn: 0.5304201	total: 16.6ms	remaining: 16.5s
100:	learn: 0.0021255	total: 1.35s	remaining: 12s
200:	learn: 0.0011834	total: 2.68s	remaining: 10.7s
300:	learn: 0.0007844	total: 3.96s	remaining: 9.2s
400:	learn: 0.0004942	total: 5.37s	remaining: 8.02s
500:	learn: 0.0003041	total: 6.68s	remaining: 6.65s
600:	learn: 0.0002271	total: 7.95s	remaining: 5.28s
700:	learn: 0.0001848	total: 9.25s	remaining: 3.94s
800:	learn: 0.0001549	total: 10.5s	remaining: 2.62s
900:	learn: 0.0001350	total: 11.9s	remaining: 1.3s
999:	learn: 0.0001155	total: 13.1s	remaining: 0us
Learning rate set to 0.049377
0:	learn: 0.5255427	total: 17ms	remaining: 16.9s
100:	learn: 0.0014841	total: 1.37s	remaining: 12.2s
200:	learn: 0.0007662	total: 2.85s	remaining: 11.3s
300:	learn: 0.0004822	total: 4.21s	remaining: 9.77s
400:	learn: 0.0003400	total: 5.52s	remaining: 8.25s
500:	learn: 0.0002166	total: 6.85s	remaining: 6.82s
600:	learn: 0.0001520	total: 8.14s	remaining: 5.4s
700:	learn: 0.0001203	total: 9.43s	remaining: 4.02s
800:	learn: 0.0001001	total: 10.7s	remaining: 2.66s
900:	learn: 0.0000853	total: 12s	remaining: 1.32s
999:	learn: 0.0000741	total: 13.3s	remaining: 0us
Learning rate set to 0.049377
0:	learn: 0.5286848	total: 16.9ms	remaining: 16.9s
100:	learn: 0.0013962	total: 1.3s	remaining: 11.6s
200:	learn: 0.0007208	total: 2.61s	remaining: 10.4s
300:	learn: 0.0004411	total: 3.92s	remaining: 9.11s
400:	learn: 0.0002810	total: 5.21s	remaining: 7.78s
500:	learn: 0.0001784	total: 6.49s	remaining: 6.46s
600:	learn: 0.0001305	total: 7.79s	remaining: 5.17s
700:	learn: 0.0001020	total: 9.13s	remaining: 3.89s
800:	learn: 0.0000845	total: 10.4s	remaining: 2.59s
900:	learn: 0.0000725	total: 11.7s	remaining: 1.29s
999:	learn: 0.0000624	total: 13s	remaining: 0us
Learning rate set to 0.049378
0:	learn: 0.5295792	total: 15.9ms	remaining: 15.9s
100:	learn: 0.0019820	total: 1.31s	remaining: 11.7s
200:	learn: 0.0010341	total: 2.68s	remaining: 10.7s
300:	learn: 0.0006729	total: 3.98s	remaining: 9.24s
400:	learn: 0.0004194	total: 5.31s	remaining: 7.94s
500:	learn: 0.0002840	total: 6.61s	remaining: 6.58s
600:	learn: 0.0002062	total: 7.92s	remaining: 5.26s
700:	learn: 0.0001594	total: 9.21s	remaining: 3.93s
800:	learn: 0.0001277	total: 10.5s	remaining: 2.62s
900:	learn: 0.0001099	total: 11.8s	remaining: 1.3s
999:	learn: 0.0000959	total: 13.3s	remaining: 0us
Learning rate set to 0.049378
0:	learn: 0.5297836	total: 19.3ms	remaining: 19.3s
100:	learn: 0.0022455	total: 1.33s	remaining: 11.9s
200:	learn: 0.0012829	total: 2.68s	remaining: 10.7s
300:	learn: 0.0008490	total: 3.99s	remaining: 9.26s
400:	learn: 0.0004834	total: 5.28s	remaining: 7.88s
500:	learn: 0.0003394	total: 6.61s	remaining: 6.58s
600:	learn: 0.0002504	total: 7.92s	remaining: 5.26s
700:	learn: 0.0001931	total: 9.24s	remaining: 3.94s
800:	learn: 0.0001571	total: 10.6s	remaining: 2.63s
900:	learn: 0.0001288	total: 12s	remaining: 1.31s
999:	learn: 0.0001128	total: 13.3s	remaining: 0us
Time taken: 1 min 39 secs
ROC AUC Score  0.8672765955003272
In [31]:
# float feature
feature_name = 'Amount'
dict_stats = model.calc_feature_statistics(df_Xtrain, ser_ytrain, feature_name, plot=True)

Feature Importance

In [32]:
# feature importance
df_imp = pd.DataFrame({'Feature': features,
                       'Importance': model.feature_importances_
                       }) 

df_imp.sort_values('Importance',ascending=False).style.background_gradient()
Out[32]:
Feature Importance
4 V4 10.0278
1 V1 7.65788
14 V14 6.21339
26 V26 6.16848
29 Amount 5.21334
16 V16 5.061
2 V2 4.97336
8 V8 4.86713
7 V7 3.80879
0 Time 3.64571
9 V9 3.49311
6 V6 3.32542
24 V24 3.15045
19 V19 3.03403
12 V12 2.97879
17 V17 2.97088
13 V13 2.90739
3 V3 2.40602
10 V10 2.34503
21 V21 2.25003
25 V25 2.19424
15 V15 2.16149
18 V18 2.02215
27 V27 1.95333
28 V28 1.37484
20 V20 0.908247
11 V11 0.849226
22 V22 0.74941
23 V23 0.708907
5 V5 0.580155
In [33]:
def plot_feature_imp_catboost(model_catboost,features):
    """Plot the feature importance horizontal bar plot.
    
    """

    df_imp = pd.DataFrame({'Feature': model.feature_names_,
                        'Importance': model.feature_importances_
                        }) 

    df_imp = df_imp.sort_values('Importance').set_index('Feature')
    ax = df_imp.plot.barh(figsize=(12,8))

    plt.grid(True)
    plt.title('Feature Importance',fontsize=14)
    ax.get_legend().remove()

    for p in ax.patches:
        x = p.get_width()
        y = p.get_y()
        text = '{:.2f}'.format(p.get_width())
        ax.text(x, y,text,fontsize=15,color='indigo')

    plt.show()

plot_feature_imp_catboost(model, features)
In [34]:
df_fimp = model.get_feature_importance(prettified=True)
df_fimp.head()
Out[34]:
Feature Id Importances
0 V4 10.027790
1 V1 7.657877
2 V14 6.213393
3 V26 6.168477
4 Amount 5.213338
In [35]:
plt.figure(figsize=(12,8))
ax = sns.barplot(x=df_fimp.columns[1], y=df_fimp.columns[0], data=df_fimp);

for p in ax.patches:
    x = p.get_width()
    y = p.get_y()
    text = '{:.2f}'.format(p.get_width())
    ax.text(x, y,text,fontsize=15,color='indigo',va='top',ha='left')

catboost using Pool

In [36]:
from catboost import CatBoost, Pool
In [37]:
# help(CatBoost)
In [38]:
cat_features = [] # take it empty for the moment
dtrain = Pool(df_Xtrain, ser_ytrain, cat_features=cat_features)
dvalid = Pool(df_Xvalid, ser_yvalid, cat_features=cat_features)
dtest = Pool(df_Xtest, ser_ytest, cat_features=cat_features)
In [39]:
params_cat = {'iterations': 100, 'verbose': False, 
          'random_seed': 0,
          'eval_metric':'AUC',
          'loss_function':'Logloss',
          'cat_features': [],
          'ignored_features': [],
          'early_stopping_rounds': 200,
          'verbose': 200,
          }

bst_cat = CatBoost(params=params_cat)

bst_cat.fit(dtrain,           
            eval_set=(df_Xvalid, ser_yvalid), 
          use_best_model=True,
          plot=True);

print(bst_cat.eval_metrics(dtest, ['AUC'])['AUC'][-1])
Learning rate set to 0.3611
0:	test: 0.9164238	best: 0.9164238 (0)	total: 73.6ms	remaining: 7.29s
99:	test: 0.9808154	best: 0.9840566 (35)	total: 3.72s	remaining: 0us

bestTest = 0.9840565878
bestIteration = 35

Shrink model to first 36 iterations.
0.9782066843338348

Cross Validation

cv(pool=None, params=None, dtrain=None, iterations=None, 
num_boost_round=None, fold_count=None, nfold=None, inverted=False,
partition_random_seed=0, seed=None, shuffle=True, logging_level=None,
stratified=None, as_pandas=True, metric_period=None, verbose=None,
verbose_eval=None, plot=False, early_stopping_rounds=None,
save_snapshot=None, snapshot_file=None,
snapshot_interval=None, folds=None, type='Classical')
In [40]:
params = {'iterations': 100, 'verbose': False,
          'random_seed': 0,
          'loss_function':'Logloss',
          'eval_metric':'AUC',
          }

df_scores = catboost.cv(dtrain,
            params,
            fold_count=2,
            verbose=100,
            shuffle=True,
            stratified=True,
            plot="True") # plot does not work in google colab
0:	test: 0.9182109	best: 0.9182109 (0)	total: 191ms	remaining: 18.9s
99:	test: 0.9772304	best: 0.9784382 (54)	total: 6.52s	remaining: 0us
In [41]:
print(df_scores.columns)
df_scores.head()
Index(['iterations', 'test-AUC-mean', 'test-AUC-std', 'test-Logloss-mean',
       'test-Logloss-std', 'train-Logloss-mean', 'train-Logloss-std'],
      dtype='object')
Out[41]:
iterations test-AUC-mean test-AUC-std test-Logloss-mean test-Logloss-std train-Logloss-mean train-Logloss-std
0 0 0.918211 0.015632 0.585840 0.001246 0.585823 0.001171
1 1 0.922383 0.027860 0.500689 0.002353 0.500659 0.002239
2 2 0.933871 0.022411 0.425035 0.003157 0.425024 0.003205
3 3 0.928061 0.020897 0.365778 0.003360 0.365737 0.003457
4 4 0.939572 0.017085 0.310018 0.004005 0.309959 0.003970
In [42]:
sns.lineplot(x='iterations',y='train-Logloss-mean',data=df_scores,ax=ax,color='r')
sns.lineplot(x='iterations',y='test-Logloss-mean',data=df_scores,ax=ax,color='b',alpha=0.2,linewidth=5,linestyle='--')

plt.show()

HPO (Hyper Parameter Optimization)

We generally should optimize model complexity and then tune the convergence.

model complexity: max_depth etc convergence: learning rate

Parameters:

  • learning_rate: step size shrinkage used to prevent overfitting. Range is [0,1]
  • depth: determines how deeply each tree is allowed to grow during any boosting round.
  • subsample: percentage of samples used per tree. Low value can lead to underfitting.
  • colsample_bytree: percentage of features used per tree. High value can lead to overfitting.

Baseline model

In [43]:
model = CatBoostClassifier(verbose=100,random_state=SEED)

model.fit(df_Xtrain, ytr)

ypreds = model.predict(df_Xtest)

cm = confusion_matrix(ytest, ypreds)
print(cm)
Learning rate set to 0.073099
0:	learn: 0.4520944	total: 46.6ms	remaining: 46.5s
100:	learn: 0.0016060	total: 3.47s	remaining: 30.9s
200:	learn: 0.0011228	total: 7.42s	remaining: 29.5s
300:	learn: 0.0008612	total: 12s	remaining: 28s
400:	learn: 0.0006628	total: 15.8s	remaining: 23.7s
500:	learn: 0.0005038	total: 21.6s	remaining: 21.5s
600:	learn: 0.0003722	total: 25.4s	remaining: 16.9s
700:	learn: 0.0002731	total: 29.3s	remaining: 12.5s
800:	learn: 0.0001932	total: 33.1s	remaining: 8.22s
900:	learn: 0.0001492	total: 37.2s	remaining: 4.09s
999:	learn: 0.0001250	total: 41.3s	remaining: 0us
[[56863     1]
 [   22    76]]

Using Early Stopping from Validation Set

In [44]:
params = dict(verbose=500,
              random_state=0,
              iterations=3_000,
              eval_metric='AUC',
              cat_features = [],
              early_stopping_rounds=200,
            )


model = catboost.CatBoostClassifier(**params)

model.fit(df_Xtrain, ytrain, 
          eval_set=(df_Xvalid, yvalid), 
          use_best_model=True, 
          plot=False
         );
Learning rate set to 0.084221
0:	test: 0.9164238	best: 0.9164238 (0)	total: 43.4ms	remaining: 2m 10s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.9809169354
bestIteration = 43

Shrink model to first 44 iterations.
In [45]:
time_start = time.time()


model = CatBoostClassifier(verbose=False,random_state=0,iterations=50)
model.fit(df_Xtrain, ser_ytrain)

ypreds = model.predict(df_Xtest)

cm = confusion_matrix(ytest, ypreds)
error = cm[0,1] + cm[1,0]

time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))

print('Errro confusion matrix', error)

# using 50 iterations is worse, use previous 1000.
Time taken: 0 min 3 secs
Errro confusion matrix 29
In [46]:
for n in [6]: # default detpth = 6

    model = CatBoostClassifier(verbose=False,random_state=0,
                              iterations=1_000,
                              depth=n,
                              )
    model.fit(Xtr, ytr)
    ypreds = model.predict(Xtx)
    cm = confusion_matrix(ytest, ypreds)
    error = cm[0,1] + cm[1,0]
    print(f'Confusion matrix error count = {error} for n = {n}')
Confusion matrix error count = 24 for n = 6

Try Your luck with different random states

In [47]:
for n in [0]: 

    model = CatBoostClassifier(verbose=False,random_state=n,
                               depth=6,
                              iterations=1_000,
                              )
    model.fit(Xtr, ytr)
    ypreds = model.predict(Xtx)
    cm = confusion_matrix(ytest, ypreds)
    error = cm[0,1] + cm[1,0]
    print(f'Confusion matrix error count = {error} for n = {n}')
Confusion matrix error count = 24 for n = 0

HPO Hyper Parameter Optimization with Optuna

In [48]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING) # use INFO to see progress
In [49]:
def objective(trial):

    params_cat_optuna = {
        'objective': trial.suggest_categorical('objective', ['Logloss', 'CrossEntropy']),
        'colsample_bylevel': trial.suggest_uniform('colsample_bylevel', 0.01, 0.1),
        'depth': trial.suggest_int('depth', 1, 12),
        'boosting_type': trial.suggest_categorical('boosting_type', ['Ordered', 'Plain']),
        'bootstrap_type': trial.suggest_categorical('bootstrap_type',
                                                    ['Bayesian', 'Bernoulli', 'MVS']),
        'used_ram_limit': '3gb'
    }

    # update parameters
    if params_cat_optuna['bootstrap_type'] == 'Bayesian':
        params_cat_optuna['bagging_temperature'] = trial.suggest_uniform('bagging_temperature', 0, 10)
    elif params_cat_optuna['bootstrap_type'] == 'Bernoulli':
        params_cat_optuna['subsample'] = trial.suggest_uniform('subsample', 0.1, 1)
        
    # fit the model
    model = CatBoostClassifier(random_state=SEED,**params_cat_optuna)
    model.fit(df_Xtrain, ser_ytrain,
            eval_set=[(df_Xvalid, ser_yvalid)],
            verbose=0,
            early_stopping_rounds=100)
    
    ypreds = model.predict(df_Xvalid)
    ypreds = np.rint(ypreds)
    score = roc_auc_score(ser_yvalid.to_numpy().ravel(),
                              ypreds)
    return score
In [50]:
# NOTE: there is inherent non-determinism in optuna hyperparameter selection
#       we may not get the same hyperparameters when run twice.


sampler = optuna.samplers.TPESampler(seed=SEED)
N_TRIALS = 1 # make it large

study = optuna.create_study(direction='maximize',
                            sampler=sampler,
                            study_name='cat_optuna',
                            storage='sqlite:///cat_optuna_fraud_detection.db',
                            load_if_exists=True)

study.optimize(objective, n_trials=N_TRIALS,timeout=600)
In [51]:
# Resume from last time
sampler = optuna.samplers.TPESampler(seed=SEED)
N_TRIALS = 1 # make it large

study = optuna.create_study(direction='maximize',
                            sampler=sampler,
                            study_name='cat_optuna',
                            storage='sqlite:///cat_optuna_fraud_detection.db',
                            load_if_exists=True)

study.optimize(objective, n_trials=N_TRIALS)
In [52]:
print(f'Number of finished trials: {len(study.trials)}')

# best trail
best_trial = study.best_trial

# best params
params_best = study.best_trial.params
params_best
Number of finished trials: 10
Out[52]:
{'bagging_temperature': 1.4860484007536512,
 'boosting_type': 'Plain',
 'bootstrap_type': 'Bayesian',
 'colsample_bylevel': 0.07040400702545975,
 'depth': 8,
 'objective': 'Logloss'}
In [53]:
# time
time_start = time.time()

model_name = 'catboost'
desc = 'grid search optuna'
Xtr = df_Xtrain_orig
ytr = ser_ytrain_orig.to_numpy().ravel()
Xtx = df_Xtest
ytx = ser_ytest.to_numpy().ravel()
Xvd = df_Xvalid
yvd = ser_yvalid.to_numpy().ravel()


# use best model
params_best =  study.best_trial.params

clf_lgb = clf_lgb = CatBoostClassifier(random_state=SEED,verbose=False)
clf_lgb.set_params(**params_best)

# fit and save the model
clf_lgb.fit(Xtr, ytr)
joblib.dump(clf_lgb,'../outputs/clf_cat_grid_search_optuna.pkl')

# load the saved model
clf_lgb = joblib.load('../outputs/clf_cat_grid_search_optuna.pkl')

# predictions
skf = StratifiedKFold(n_splits=2,shuffle=True,random_state=SEED)
ypreds_cv = cross_val_predict(clf_lgb, Xtx, ytx, cv=skf)
ypreds = ypreds_cv

# model evaluation
average = 'binary'
row_eval = [model_name,desc, 
            accuracy_score(ytx, ypreds),
            precision_score(ytx, ypreds, average=average),
            recall_score(ytx, ypreds, average=average),
            f1_score(ytx, ypreds, average=average),
            roc_auc_score(ytx, ypreds),
            ]

df_eval.loc[len(df_eval)] = row_eval
df_eval = df_eval.drop_duplicates()
time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))
display(df_eval)
Time taken: 1 min 51 secs
Model Description Accuracy Precision Recall F1 AUC
0 catboost default, seed=100 0.999403 0.999403 0.999403 0.999403 0.857090
1 catboost grid search optuna 0.999368 0.930556 0.683673 0.788235 0.841793
In [54]:
df_eval.sort_values('Recall',ascending=False).style.background_gradient(subset='Recall')
Out[54]:
Model Description Accuracy Precision Recall F1 AUC
0 catboost default, seed=100 0.999403 0.999403 0.999403 0.999403 0.85709
1 catboost grid search optuna 0.999368 0.930556 0.683673 0.788235 0.841793
In [55]:
cm = confusion_matrix(ytest,ypreds)
vals = cm.ravel()

cm
Out[55]:
array([[56859,     5],
       [   31,    67]])
In [56]:
print('Catboost Grid Search Results')
print('-'*25)
print('Total Frauds: ', vals[2] + vals[3])
print('Incorrect Frauds: ', vals[2])
print('Incorrect Percent: ', round(vals[2]*100/(vals[2]+vals[3]),2),'%')
Catboost Grid Search Results
-------------------------
Total Frauds:  98
Incorrect Frauds:  31
Incorrect Percent:  31.63 %
In [57]:
from bhishan.bhishan import plotly_binary_clf_evaluation

yprobs = model.predict_proba(df_Xtest)
yprobs = yprobs[:,0] # take only first column
plotly_binary_clf_evaluation('clf_lgb_optuna',model,ytx,ypreds,yprobs,df)
In [58]:
yprobs
Out[58]:
array([0.99989527, 0.99999168, 0.99999445, ..., 0.99998248, 0.9999996 ,
       0.99999638])

Best Model

In [59]:
model = CatBoostClassifier(verbose=False,random_state=100,
                            depth=6,
                            iterations=1_000,
                            )
model.fit(Xtr, ytr)
ypreds = model.predict(Xtx)
cm = confusion_matrix(ytest, ypreds)
error = cm[0,1] + cm[1,0]
print(f'Confusion matrix error count = {error} for n = {n}')
Confusion matrix error count = 21 for n = 0
In [60]:
print(cm)
[[56864     0]
 [   21    77]]

Model Interpretation

In [61]:
df_Xtrain.head(2).append(df_Xtest.head(2))
Out[61]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
35574 38177.0 1.430419 -0.718078 0.364706 -0.744257 -0.556090 0.698948 -0.949852 0.131008 -0.314353 ... 0.158424 0.042013 0.429576 -0.301931 -0.933773 0.840490 -0.027776 0.044688 -0.007522 0.20
46862 42959.0 -2.425523 -1.790293 2.522139 0.581141 0.918453 0.594426 0.224541 0.373885 -0.168411 ... 0.984535 0.538438 0.877560 0.590595 -0.293545 0.524022 -0.328189 -0.205285 -0.109163 300.00
248750 154078.0 0.046622 1.529678 -0.453615 1.282569 1.110333 -0.882716 1.046420 -0.117121 -0.679897 ... 0.240559 -0.338472 -0.839547 0.066527 0.836447 0.076790 -0.775158 0.261012 0.058359 18.70
161573 114332.0 0.145870 0.107484 0.755127 -0.995936 1.159107 2.113961 0.036200 0.471777 0.627622 ... -0.107332 0.297644 1.285809 -0.140560 -0.910706 -0.449729 -0.235203 -0.036910 -0.227111 9.99

4 rows × 30 columns

Model interpretation using eli5

In [62]:
import eli5

eli5.show_weights(model)
Out[62]:
Weight Feature
0.0981 V1
0.0910 V4
0.0753 V14
0.0574 V26
0.0495 Amount
0.0454 V12
0.0412 V2
0.0369 Time
0.0351 V15
0.0329 V10
0.0309 V19
0.0285 V25
0.0283 V20
0.0282 V22
0.0273 V13
0.0260 V8
0.0257 V11
0.0247 V3
0.0244 V7
0.0238 V6
… 10 more …
In [63]:
from eli5.sklearn import PermutationImportance

feature_names = df_Xtrain.columns.tolist()

perm = PermutationImportance(model).fit(df_Xtest, ytx)
eli5.show_weights(perm, feature_names=feature_names)
Out[63]:
Weight Feature
0.0009 ± 0.0001 V14
0.0002 ± 0.0001 V4
0.0002 ± 0.0000 V10
0.0002 ± 0.0000 V26
0.0001 ± 0.0001 Amount
0.0001 ± 0.0000 V28
0.0001 ± 0.0000 V12
0.0000 ± 0.0000 V27
0.0000 ± 0.0000 Time
0.0000 ± 0.0000 V16
0.0000 ± 0.0000 V17
0.0000 ± 0.0000 V6
0.0000 ± 0.0000 V11
0.0000 ± 0.0000 V3
0.0000 ± 0.0000 V7
0.0000 ± 0.0000 V1
0.0000 ± 0.0000 V18
0.0000 ± 0.0000 V22
0.0000 ± 0.0000 V9
0.0000 ± 0.0000 V19
… 10 more …

Model interpretation using lime

In [64]:
import lime
import lime.lime_tabular
In [65]:
idx = 0
example = df_Xtest.iloc[idx]
answer = ser_ytest.iloc[idx]
feature_names = df_Xtest.columns.tolist()

prediction = model.predict(example.to_numpy().reshape(-1,1).T)


print(f'answer     = {answer}')
print('prediction = ', prediction[0])
print()
print(example)
print(feature_names)
answer     = 0
prediction =  0.0

Time      154078.000000
V1             0.046622
V2             1.529678
V3            -0.453615
V4             1.282569
V5             1.110333
V6            -0.882716
V7             1.046420
V8            -0.117121
V9            -0.679897
V10           -0.923709
V11            0.371519
V12           -0.000047
V13            0.512255
V14           -2.091762
V15            0.786796
V16            0.159652
V17            1.706939
V18            0.458922
V19            0.037665
V20            0.240559
V21           -0.338472
V22           -0.839547
V23            0.066527
V24            0.836447
V25            0.076790
V26           -0.775158
V27            0.261012
V28            0.058359
Amount        18.700000
Name: 248750, dtype: float64
['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
In [66]:
import lime
import lime.lime_tabular

categorical_features = []
categorical_features_idx = [df_Xtrain.columns.get_loc(col) for col in categorical_features]


explainer = lime.lime_tabular.LimeTabularExplainer(df_Xtrain.to_numpy(), 
               feature_names=feature_names, 
               class_names=['Not-fraud','Fraud'], 
               categorical_features=categorical_features_idx, 
               mode='classification')

exp = explainer.explain_instance(example, model.predict_proba, num_features=8)
exp.show_in_notebook(show_table=True)
In [67]:
exp.as_pyplot_figure(); # use semicolon
In [68]:
import shap
shap.initjs()
In [69]:
model = CatBoostClassifier(verbose=100,random_state=100)

model.fit(df_Xtrain, ytrain)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df_Xtest)
Learning rate set to 0.073099
0:	learn: 0.4520944	total: 104ms	remaining: 1m 43s
100:	learn: 0.0016060	total: 6.54s	remaining: 58.2s
200:	learn: 0.0011228	total: 12.1s	remaining: 48.2s
300:	learn: 0.0008612	total: 17.6s	remaining: 40.8s
400:	learn: 0.0006628	total: 23.6s	remaining: 35.3s
500:	learn: 0.0005038	total: 29s	remaining: 28.9s
600:	learn: 0.0003722	total: 36.2s	remaining: 24s
700:	learn: 0.0002731	total: 41.8s	remaining: 17.8s
800:	learn: 0.0001932	total: 47.4s	remaining: 11.8s
900:	learn: 0.0001492	total: 53.6s	remaining: 5.89s
999:	learn: 0.0001250	total: 59.2s	remaining: 0us
/Users/poudel/miniconda3/envs/dataSc/lib/python3.7/site-packages/shap/explainers/tree.py:104: UserWarning:

Setting feature_perturbation = "tree_path_dependent" because no background data was given.

In [70]:
df_Xtest.head(1)
Out[70]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
248750 154078.0 0.046622 1.529678 -0.453615 1.282569 1.110333 -0.882716 1.04642 -0.117121 -0.679897 ... 0.240559 -0.338472 -0.839547 0.066527 0.836447 0.07679 -0.775158 0.261012 0.058359 18.7

1 rows × 30 columns

In [71]:
df_Xtest.head(1)['V15 V18 V3 V24 V1 V8 V4 V14 V2 V6 V9 V20'.split()].round(4)
Out[71]:
V15 V18 V3 V24 V1 V8 V4 V14 V2 V6 V9 V20
248750 0.7868 0.4589 -0.4536 0.8364 0.0466 -0.1171 1.2826 -2.0918 1.5297 -0.8827 -0.6799 0.2406
In [72]:
# Look only first row of test data
# use matplotlib=True to avoid Javascript
idx = 0
shap.force_plot(explainer.expected_value,
                shap_values[idx,:],
                df_Xtest.iloc[idx,:],
                matplotlib=False,
                text_rotation=90)

# for this row, the predicted label is -9.33
# red features makes it higher
# blue features makes it smaller.
Out[72]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [73]:
NUM = 100
shap.force_plot(explainer.expected_value, shap_values[:NUM,:],
                df_Xtest.iloc[:NUM,:],matplotlib=False)
Out[73]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [74]:
shap.summary_plot(shap_values, df_Xtest)
In [75]:
shap.summary_plot(shap_values, df_Xtest, plot_type='bar')
In [76]:
shap.dependence_plot("Amount", shap_values, df_Xtest)
In [77]:
shap.dependence_plot(ind='Time', interaction_index='Amount',
                     shap_values=shap_values, 
                     features=df_Xtest,  
                     display_features=df_Xtest)
In [78]:
notebook_end_time = time.time()
time_taken = time.time() - notebook_start_time
h,m = divmod(time_taken,60*60)
print('Time taken to run whole noteook: {:.0f} hr {:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole noteook: 0 hr 12 min 28 secs